Home

Web-development

Architecting Data-Intensive Applications

By Anuj Kumar

Book

Subscription

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

Subscription

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet ?uent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.

This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn’t fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.

Publication date:: July 2018
Publisher: Packt
Pages: 340
ISBN: 9781786465092

Chapter 1. Exploring the Data Ecosystem

In God we trust. All others must bring data. —W. Edwards Deming https://deming.org/theman/overview

Until a few years ago, a successful organization was one that had access to superior technology, which, in most cases, was either proprietary to the organization or was acquired at great expense. These technologies enabled organizations to define complex business process flows, addressing specific use cases that helped them to generate revenue. In short, technology drove the business. Data did not constitute any part of the decision-making process. With such an approach, organizations could only utilize a part of their data. This resulted in lost opportunities and, in some cases, unsatisfied customers. One of the reasons for these missed opportunities was the fact that there was no reliable and economical way to store such huge quantities of data, especially when organizations didn't know how to make business out of it. Hardware costs were a prohibitive factor.

Things started to change a few years ago when Google published its white paper on GFS (https://static.googleusercontent.com/media/research.google.com/en/archive/gfs-sosp2003.pdf), which was picked up by Doug Cutting, who created the open source distributed file system, called Apache Hadoop, capable of storing large volumes of data using commodity hardware.

Suddenly, organizations, both big and small, realized its potential and started storing any piece of data in Hadoop that had the potential to turn itself into a source of revenue later. The industry coined a term for such a huge, raw store of data, calling it a data lake.

The Wikipedia definition of a data lake is as follows:

"A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files."

In short, a data lake is a collection of various pieces of data that may, or may not, be important to an organization. The key phrase here is natural format. What this means is that the data that is collected is seldom processed prior to being stored. The reasoning behind this is that any processing may potentially lead to a loss of information which, in turn, may have an effect in terms of generating new sources of revenue for the organization. This does not mean that we do not process the data while in flight. What it does mean is that at least one copy of the data is stored in the manner in which it was received from the external system.

But how do organizations fill this data lake and what data should be stored there? The answer to this question lies in understanding the data ecosystem that exists today. Understanding where the data originates from, and which data to persist, helps organizations to become data-driven instead of process-driven. This ability helps organizations to not only explore new business opportunities, but also helps them to react more quickly in the face of an ever-changing business landscape.

In this introductory chapter, we will:

Try to understand what we mean by a data ecosystem
Try to understand the characteristics of a data ecosystem, in other words, what constitutes a data ecosystem
Talk about some data and information sharing standards and frameworks, such as the traffic light protocol and the information exchange policy framework
Continue our exploration of the 3V's of the data ecosystem
Conclude with a couple of use cases to prove our point

I hope that this chapter will pique your interest and get you excited about exploring the rest of the book.

What is a data ecosystem?

An ecosystem is defined as a complex set of relationships between interconnected elements and their environments. For example, the social construct around our daily lives is an ecosystem. We depend on the state to provide us with basic necessities, including food, water, and gas. We rely on our local stores for our daily needs, and so on. Our livelihood is directly or indirectly dependent upon the social construct of our society. The inter-dependency, as well as the inter-connectivity of these social elements, is what defines a society.

Along the same lines, a data ecosystem can be defined as a complex set of possibly interconnected data and the environment from which that data originates. Data from social websites, such as Twitter, Facebook, and Instagram; data from connected devices, such as sensors; data from the (Industrial) Internet of Things; SCADA systems; data from your phone; and data from your home router, all constitute a data ecosystem to some extent. As we will see in the following sections, this huge variety of data, when connected, can be really useful in providing insights into previously undiscovered business opportunities.

A complex set of interconnected data

What this section implies is that data can be a collection of structured, semi-structured, or unstructured data (hence, a complex set). Additionally, data collected from different sources may relate to one another, in some form or other. To put it in perspective, let's look at a very simple use case, where data from different sources can be connected. Imagine you have an online shopping website and you would like to recommend to your visitors the things that they would most probably want to buy. For the recommendation to succeed, you may need a lot of relevant information about the person. You may want to know what a person likes/dislikes, what they have been searching for in the last few days, what they have been tweeting about, and what topics they are discussing in public forums. All these constitute different sources of data and, even though, at first glance, it may appear that the data from individual sources is not connected, the reality is that all the data pertains to one individual, and their likes and dislikes. Establishing such connections in different data sources is key for an organization when it comes to quickly turning an idea into a business opportunity.

Data environment

The environment in which the data originates is as important as the data itself. The environment provides us with the contextual information to attach to the data, which may further help us in making the correct decision. Having contextual information helps us to understand the relevancy as well as the reliability of the data source, which ultimately feeds into the decision-making process. The environment also tells us about the data lineage (to be discussed in detail in Chapter 12, When Data Dissemination Is as Important as Data Itself), which helps us to understand whether the data has been modified during its journey or not and, if it has, how it affects our use case.

Each organization has its own set of data sources that constitute their specific data ecosystem. Remember that one organization's data sources may not be the same as another organization's.

The data evangelist within the organization should always focus on identifying which sources of data are more relevant than others for a given set of use cases that the organization is trying to resolve.

This feeds into our next topic, what constitutes a data ecosystem?

What constitutes a data ecosystem?

Nowadays, data comes from a variety of sources, at varying speeds, and in a number of different formats. Understanding data and its relevance is the most important task for any data-driven organization.

To understand the importance of data, the data czars in an organization should look at all possible sources of data that may be important to them. Being far-sighted helps, although, given the pace of modern society, it is almost impossible to gather data from every relevant source. Hence, it is important that the person/people involved in identifying relevant data sources are also well aware of the business landscape in which they operate. This knowledge will help tremendously in averting problems later. Data source identifiers should also be aware that data can be sourced both inside and outside of an organization, since, at the broadest level, data is first classified as being either internal or external data.

Given the opportunity, internal data should first be converted into information. Handling internal data first helps the organization to understand its importance early in the life cycle, without needing to set up a complex system, thereby making the process agile. In addition, it also gives the technical team an opportunity to understand what technology and architecture would be most appropriate in their situation. Such a distinction also helps organizations to not only put a reliability rating on data, but also to define any security rules in connection with the data.

So, what are the different sources of data that an organization can utilize to its benefit? The following diagram depicts a part of the landscape that constitutes the data ecosystem. I say "a part" because the landscape is so huge that listing all of them would not be possible:

The preceding mentioned data can be categorized as internal or external, depending upon the business segment in which an organization is involved. For example, as regards an organization such as Facebook, all the social media-related data on its website would constitute an internal source, whereas the same data for an advertising firm would represent an external source of data.

As you may have already noticed, the preceding set of data can broadly be classified into three sub-categories:

Structured data

This type of data contains a well-defined structure that can be parsed easily by any standard machine parser. This type of data usually comes with a schema that defines the structure of the data. For example, incoming data in XML format having an associated XML schema constitutes what is known as structured data. Examples of such data include Customer Relationship Management (CRM) data, and ERP data.

Semi-structured data

Semi-structured data consists of data that does not have a formal schema associated with it. Log data from different machines can be regarded as semi-structured data. For example, a firewall log statement consists of the following fields as a minimum: the timestamp, host IP, destination IP, host port, and destination port, as well as some free text describing the event that took place resulting in the generation of the log statement.

Unstructured data

Finally, we have data that is unstructured. When I say unstructured, what I really mean is that, looking at the data, it is hard to derive any structured information directly from the data itself. It does not mean that we can't get information from the unstructured data. Examples of unstructured data include video files, audio files, and blogs, while most of the data generated on social media also falls under the category of unstructured data.

One thing to note about any kind of data is that, more often than not, each piece of data will have metadata associated with it. For example, when we take a picture using our cellphone, the picture itself constitutes the data, whereas its properties, such as when it was taken, where it was taken, what the focal length was, its brightness, and whether it was modified by software such as Adobe Photoshop, constitutes its metadata.

Sometimes, it is also difficult to clearly categorize data. For example, the scenario where a security firm that sells hardware appliances to its customers that is installed at the customer location and collects access log data constitutes one such scenario where it is difficult to categorize data. It is data for the end customer that the customer has given permission to be used for a specific purpose and that is used to detect a security threat. Thus, even though the data resides at the security organization, it still cannot be used (without consent) for any purpose other than to detect a threat for that specific customer.

This brings us to our next topic: data sharing.

Data sharing

Whenever we collect data from an external source, there is always a clause about how that data can be used. At times, this aspect is implicit, but there are times when you need to provide an explicit mechanism for how the data can be shared by the collecting organization, both within and outside the organization. This distinction becomes important when data is shared between specific organizations. For example, one particular financial institution may decide to share certain information with another financial institution because both are part of a larger consortium that requires them to work collectively towards combating cyber threats. Now, the data on cyber threats that is collected and shared by these organizations may come with certain restrictions. Namely:

When should the shared data be used?
How may this data be shared with other parties, both within and outside an organization?

There are numerous ways in which this sharing agreement can be agreed upon by organizations. Two such ways, that are defined and used by many organizations, are:

The traffic light protocol and
The information exchange policy framework from first.org

Let's discuss each of these briefly.

Traffic light protocol

The traffic light protocol (hereinafter referred to as TLP, https://www.us-cert.gov/tlp and https://www.first.org/tlp) is a set of designations used to ensure that sensitive information is shared with the appropriate audience. TLP was created to facilitate the increased sharing of information between organizations. It employs four colors to indicate the expected sharing boundaries to be applied by the recipient(s):

RED
AMBER
GREEN
WHITE

TLP provides a simple and intuitive schema for indicating when and how sensitive information can be shared, thereby facilitating more frequent and effective collaboration. TLP is not a control marking or classification scheme. TLP was not designed to handle licensing terms, handling and encryption rules, and restrictions on action or instrumentation of information. TLP labels and their definitions are not intended to have any effect on freedom of information or sunshine laws in any jurisdiction.

TLP is optimized for ease of adoption, human readability, and person-to-person sharing; it may be used in automated sharing exchanges, but is not optimized for such use.

The source is responsible for ensuring that recipients of TLP information understand and can follow TLP sharing guidance.

If a recipient needs to share the information more widely than is indicated by the original TLP designation, they must obtain explicit permission from the original source.

The United States Computer Emergency Readiness Team provides the following definition of TLP, along with its usage and sharing guidelines:

TLP color	When it should be used	How it may be shared
RED Not for disclosure, restricted to participants only.	Sources may use TLP:RED when information cannot be effectively acted upon by additional parties, and could impact on a party's privacy, reputation, or operations if misused.	Recipients may not share TLP:RED information with any parties outside of the specific exchange, meeting, or conversation in which it was originally disclosed. In the context of a meeting, for example, TLP:RED information is limited to those present at the meeting. In most circumstances, TLP:RED should be exchanged verbally or in person.
AMBER Limited disclosure, restricted to participants' organizations.	Sources may use TLP:AMBER when information requires support to be effectively acted upon, yet carries risks to privacy, reputation, or operations if shared outside of the organizations involved.	Recipients may only share TLP:AMBER information with members of their own organization, and with clients or customers who need to know the information to protect themselves or prevent further harm. Sources are at liberty to specify additional intended limits associated with the sharing: these must be adhered to.
GREEN Limited disclosure, restricted to the community.	Sources may use TLP:GREEN when information is useful for making all participating organizations, as well as peers within the broader community or sector, aware.	Recipients may share TLP:GREEN information with peers and partner organizations within their sector or community, but not via publicly accessible channels. Information in this category can be circulated widely within a particular community. TLP:GREEN information may not be released outside of the community.
WHITE Disclosure is not limited.	Sources may use TLP:WHITE when information carries minimal or no foreseeable risk of misuse, in accordance with applicable rules and procedures for public release.	Subject to standard copyright rules, TLP:WHITE information may be distributed without restriction.

Remember that this is guidance and not a rule. Therefore, if an organization feels the need to have further types of restrictions, it may certainly do so, provided the receiving entity is either aware of them and is not exposed to the extension.

Information exchange policy

The information exchange policy framework (https://www.first.org/iep) was put together by FIRST for Computer Security Incident Response Teams (CSIRT), security communities, organizations, and vendors who may consider implementation with a view to supporting their information sharing and information exchange initiatives.

The IEP framework is composed of four different policy types: Handling, Action, Sharing, and Licensing (HASL).

Let's look at each of these briefly.

Handling policy statements

Policy statement	ENCRYPT IN TRANSIT
Type	HANDLING
Description	States whether the information received has to be encrypted when it is retransmitted by the recipient.
Enumerations	MUST recipients MUST encrypt the information received when it is retransmitted or redistributed. MAY recipients MAY encrypt the information received when it is retransmitted or redistributed.
Required	NO

Policy statement	ENCRYPT IN REST
Type	HANDLING
Description	States whether the information received has to be encrypted by the recipient when it is stored.
Enumerations	MUST recipients MUST encrypt the information received when it is stored. MAY recipients MAY encrypt the information received when it is stored.
Required	NO

Action policy statements

Policy statement	PERMITTED ACTIONS
Type	ACTION
Description	States the permitted actions that recipients can take upon receiving information.
Enumerations	NONE recipients MUST NOT act upon the information received. CONTACT FOR INSTRUCTION recipients MUST contact the providers before acting upon the information received. An example is where information redacted by the provider could be derived by the recipient and the affected parties identified. INTERNALLY VISIBLE ACTIONS recipients MAY conduct actions on the information received that are only visible on the recipient's internal networks and systems, and MUST NOT conduct actions that are visible outside of the recipient's networks and systems, or that are visible to third parties. EXTERNALLY VISIBLE INDIRECT ACTIONS recipients MAY conduct indirect, or passive, actions on the information received that are externally visible and MUST NOT conduct direct, or active, actions. EXTERNALLY VISIBLE DIRECT ACTIONS recipients MAY conduct direct, or active, actions on the information received that are externally visible.
Required	NO

Policy statement	AFFECTED PARTY NOTIFICATIONS
Type	ACTION
Description	Recipients are permitted to notify affected third parties of a potential compromise or threat. Examples include permitting National CSIRTs to send notifications to affected constituents, or a service provider contacting affected customers.
Enumerations	MAY recipients MAY notify affected parties of a potential compromise or threat. MUST NOT recipients MUST NOT notify affected parties of potential compromises or threats.
Required	NO

Sharing policy statements

Policy statement	TRAFFIC LIGHT PROTOCOL
Type	SHARING
Description	Recipients are permitted to redistribute the information received within the scope of redistribution, as defined by the enumerations. The enumerations "RED", "AMBER", "GREEN", and "WHITE" in this document are to be interpreted as described in the FIRST traffic light protocol policy.
Enumerations	RED Personal for identified recipients only. AMBER Limited sharing based on a need-to-know basis. GREEN Community-wide sharing. WHITE Unlimited sharing.
Required	NO

Policy statement	PROVIDER ATTRIBUTION
Type	SHARING
Description	Recipients could be required to attribute or anonymize the provider when redistributing the information received.
Enumerations	MAY recipients MAY attribute the provider when redistributing the information received. MUST recipients MUST attribute the provider when redistributing the information received. MUST NOT recipients MUST NOT attribute the provider when redistributing the information received.
Required	NO

Policy statement	OBFUSCATE AFFECTED PARTIES
Type	SHARING
Description	Recipients could be required to obfuscate or anonymize information that could be used to identify the affected parties before redistributing the information received. Examples include removing affected parties' IP addresses, or removing the affected parties names, but leaving the affected parties' industry vertical prior to sending a notification.
Enumerations	MAY recipients MAY obfuscate information concerning the specific parties affected. MUST recipients MUST obfuscate information concerning the specific parties affected. MUST NOT recipients MUST NOT obfuscate information concerning the specific parties affected.
Required	NO

Licensing policy statements

Policy statement	EXTERNAL REFERENCE
Type	LICENSING
Description	This statement can be used to convey a description or reference to any applicable licenses, agreements, or conditions between the producer and receiver, for example, specific terms of use, contractual language, agreement name, or a URL.
Enumerations	There are no EXTERNAL REFERENCE enumerations and this is a free-form text field.
Required	NO

Policy statement	UNMODIFIED RESALE
Type	LICENSING
Description	States whether the recipient MAY or MUST NOT resell the information received unmodified, or in a semantically equivalent format, for example, transposing the information from a `.csv` file format to a `.json` file format would be considered semantically equivalent.
Enumerations	MAY recipients MAY resell the information received. MUST NOT recipients MUST NOT resell the information received unmodified or in a semantically equivalent format.
Required	NO

Metadata policy statements

Policy statement	POLICY ID
Type	METADATA
Description	Provides a unique ID to identify a specific IEP implementation.
Required	YES

Policy statement	POLICY VERSION
Type	METADATA
Description	States the version of the IEP framework that has been used, for instance, 1.0.
Required	NO

Policy statement	POLICY NAME
Type	METADATA
Description	This statement can be used to provide a name for an IEP implementation, for instance, FIRST Mailing List IEP.
Required	NO

Policy statement	POLICY START DATE
Type	METADATA
Description	States the UTC date from when the IEP is effective.
Required	NO

Policy statement	POLICY END DATE
Type	METADATA
Description	States the UTC date that the IEP is effective until.
Required	NO

Policy statement	POLICY REFERENCE
Type	METADATA
Description	This statement can be used to provide a URL reference to the specific IEP implementation.
Required	NO

It is very important for any organization to understand where they are gathering the data from and what the obligations associated with the data are before using it for both internal purposes or sharing. A lack of clear understanding of these could lead to breaches of trust and may not be a desirable situation.

Now that data sharing is behind us, let's talk a little bit about the nature of the data itself. Data usually exhibits three characteristics, which are essential to understand when it comes to designing the data collection system (to be discussed in the next couple of chapters). Industry calls it the 3 V's of data. Let's briefly look at what the 3 V's stand for and why they are important to bear in mind when designing the system.

The 3 V's

The 3 V's stand for:

Volume
Variety
Velocity

Volume

Today's world consists of petabytes of data being emitted by a variety of sources, be it social media, sensors, blockchain, video, audio, or even transactional. The data collected can be huge, depending on the nature of the business, but, if you are reading this book, it essentially means that you have huge volumes of data that you need to understand how to handle in an effective manner.

Variety

Variety refers to the different data formats. Relational databases, Excel files, or even simple text files are all examples of different data formats. A system should be capable of handling new varieties of data as and when they arrive. Extensibility is the key component for a data-intensive system when it comes to handling varieties of data. Data variety can be broadly classified into three major blocks:

Structured: Data that has a well-defined schema associated with it, for example, relational data, and XML-formatted data.
Semi-structured: Data whose structure can be anticipated but that does not always conform to a set standard. Examples include JSON-formatted data, and columnar data.
Unstructured: Binary large object (BLOB) data, for example, video, and audio.

Velocity

Velocity denotes the speed at which the data arrives and becomes stale. There was a time when even one month-old data was considered fresh. In today's world, where social media has taken the place of traditional information sources and sensors have replaced human log books, we can't even rely on yesterday's data as it may have already become stale. The data moves at near real time and, if not processed properly and in time, may represent a lost opportunity for the business.

Until now, we have only discussed the data ecosystem, what it consists of, what requirements are associated with it in terms of the ability to share, and the types of data you can expect to collect. None of this will make sense unless we associate the data ecosystem and collection with the value drivers associated with that data for an organization.

Broadly speaking, any data that an organization decides to collect or use has two motivations/intentions behind it. Either the organization wants to use it for improving its own system/processes, or it wants to place itself strategically in a situation where it can generate new opportunities for itself.

Better decision-making processes, be they quicker or more proactive, are directly proportional to the revenue of a company.

Improvements in internal capabilities, either via automation or improved business process management, save time and money, thereby giving organizations more opportunities to innovate and, in turn, reducing costs further and opening up new business opportunities.

As you may have already noticed, this is a circle of dependencies and, once an organization can find a balance within this circle, the only way for it is upward.

Use cases

Having understood the data ecosystem and its constituent elements, let's finally look at some practical use cases that could lead an organization to start thinking in terms of data rather than processes.

Use case 1 – Security

Until a few years ago, the best way to combat external cyber security threats was to create a series of firewalls that were assumed to be impenetrable and thereby provide security to the systems behind the firewall. To combat internal cyber attacks, anti-virus software was considered to be more than sufficient. This traditional defense gave a sense of security, but was more of an illusion than a reality. Typical software system attackers are well versed in hiding in plain sight and, consequently, looking for "known bad" signatures didn't help in combating Advanced Persistent Threats (aka APT). As systems developed in complexity, the attack patterns also became sophisticated, with coordinated hacking efforts persisting over a long period and exploiting every aspect of the vulnerable system.

For example, a use case within the security domain is the Detection of Anomaly within the generated machine data, where the data is explored to identify any non-homogeneous event or transaction in a seemingly homogeneous set of events. An example of anomaly detection is when banks perform sophisticated transformations and context association with incoming credit card transactions to identify whether a transaction looks suspicious. Banks do it to prevent fraudsters from looting the bank, either directly or indirectly.

Organizations responded by creating hunting teams that looked at various data (for example, system logs, network packets, and firewall access logs) with a view to doing the following:

Hunting for undetected intrusions/breaches
Detecting anomalies and raising alerts in connection with any malicious activity

The main challenges for organizations in terms of creating these hunting teams were the following:

The fact that data is scattered throughout the organization's IT landscape
Data quality issues and multiple data versioning issues
Access and contractual limitations

All these requirements and challenges created the need for a platform that can support various data formats and a platform that is capable of:

Long-term data retention
Correlating different data sources
Providing fast access to correlated data
Real-time analysis

Use case 2 – Modem data collection

XYZ is a large Telecom giant that provides modems to its clients for the purpose of high-speed internet access. The company purchases these modems from four different vendors and then distributes them under its brand. It has a good customer base and distributes in the region of 1 million modems across a vast geographic area. This may sound all well and good for the business, but the company receives around 100 complaints daily, by phone, about the modem not working. To handle these customer complaints and provide efficient after-sales service, the company must employ 25 customer engagement staff on a full-time basis. Every call from the customer lasts around five minutes. This results in a total of (5 min * 100 calls) = 500 minutes dedicated to solving modem complaints every day. In addition to this, every third call results in the recall of a modem and sending a replacement to the customer, all at the company's expense.

The company has further identified that almost 90% of the returned modems work properly and, hence, the actual root of the problem is not modems malfunctioning, but rather faulty or incorrect setup.

All told, handling calls and replacing non-faulty modems is costing the company 1 million euros annually.

It has now decided to take a more proactive approach to solving the issue so that it can detect whether the problem is at the modem level or with the actual setup of the modem. To do this, it has planned to collect anonymous data from each modem every second, analyzing it on certain baseline conditions, and creating alerts if there is a significant deviation from the norm.

Each modem sends around 1 kilobyte of data every second. With one million modems, out there, this results in 1 KB * 1,000,000 = 1,000,000 KB = 1 GB/sec.

Thus, in a day, the company needs to collect 1 GB * 60 sec * 60 min * 24 Hours = 86.4 TB of data.

This is a huge amount and, in order to collect such a huge amount of data, the company needs a platform that is not only capable of fast ingestion, but also quick real-time analysis. Thus, it decides to build a platform that can handle such data intensity and volumes.

Summary

Just as it is important to capture data for various efficiencies and insights, it is also equally important to understand what data an organization does not want. You may think that you need everything, but the truth is that you do not want everything. Understanding what you need is critical to hastening the journey toward becoming a data-driven organization.

This chapter acts as a precursor to what is coming next. It has given you an idea of the vast landscape of data that we are in and has given you a foundation on which we will build further in forthcoming chapters. Stay tuned as the journey has only just started.

About the Author

Anuj Kumar

Anuj Kumar is a senior enterprise architect with FireEye, a Cyber Security Service Provider where he is involved in the Architecture, Strategy, and Design of various systems that deal with huge amounts of data on a regular basis. Anuj has more than 15 years of professional IT Industry experience ranging from development, design, architecture, management, and strategy. He’s an active member of OASIS Technical Committee on STIX/TAXII specification. He is a firm believer in Agile Methodology, Modular/ (Staged) Event Driven Architecture, API-First Approach, and Continuous Integration/Deployment/Delivery.

Anuj is also an author of Easy Test Framework, which is a Data Driven Testing Framework used by more than 50 companies.
Browse publications by this author

The service was excellence, even though I wasn't available to receive my package, They were able to get my package delivered.

I am living in South America and we need tracking of our purchases