We are living in a digital world in which data is growing at an exponential rate; every digital device sends data on a regular basis and it is continuously being stored. Now, storing huge amounts of data is not a problem—we can use cheap hard drives to store as much data as we want. But the most important thing that we can do with that data is to get the information that we need or want out of it. Once we understand our data, we can then analyze or visualize it. This data can be from any domain, such as accounting, infrastructure, healthcare, business, medical, Internet of Things (IoT), and more, and it can be structured or unstructured. The main challenge for any organization is to first understand the data they are storing, analyze it to get the information they need, create visualizations, and, from this, gain an insight of the data in a visual format that is easy to understand and enables people in management roles to take quick decisions.
However, it can be difficult to fetch information from data due to the following reasons:
- Data brings complexity: It is not easy to get to the root cause of any issue; for example, let's say that we want to find out why the traffic system of a city behaves badly on certain days of a month. This issue could be dependent on another set of data that we may not be monitoring. In this case, we could get a better understanding by checking the weather report data for the month. We can then try and find any correlations between the data and discover a pattern.
- Data comes from different sources: As I have already mentioned, one dataset can depend on another dataset and they can come from two different sources. Now, there may be instances where we cannot get access to all the data sources that are dependent on each other and, for these situations, it is important to understand and gather data from other sources and not just the one that you are interested in.
- Data is growing at a faster pace: As we move toward a digital era, we are capturing more and more data. As data grows at a quicker pace, it also creates issues in terms of what to keep, how to keep it, and how to process such huge amounts of data to get the relevant information that we need from it.
We can solve these issues by using the Elastic Stack, as we can store data from different sources by pushing it to Elasticsearch and then analyzing and visualizing it in Kibana. Kibana solves many data analysis issues as it provides many features that allow us to play around with the data, and we can also do a lot of things with it. In this book, we will cover all of these features and try to cover their practical implementation as well.
In this chapter, we will cover the following topics:
- Data analysis and visualization challenges for industries
- Understanding your data for analysis in Kibana
- Limitations with existing tools
- Components of the Elastic Stack
Depending on the industry, the use cases can be very different in terms of data usage. In any given industry, data is used in different ways and for different purposes—whether it's for security analytics or order management. Data comes in various formats and different scales of volumes. In the telecommunications industry, for example, it's very common to see projects about the quality of services where data is taken from 100,000 network devices.
The challenge for these industries is to handle the huge quantities of data and to get real-time visualizations from which decisions can be taken. Data capture is usually performed for applications, but to utilize this data for creating a real-time dashboard is a challenge. For that, Kibana can be used, along with Beats and Logstash, to push data from different sources, Elasticsearch can be used to store that data, and then, finally, Kibana can be used to analyze and visualize it. So, if we summarize the industry issue, it has the same canonical issues as the following:
- How to handle huge quantities of data as this comes with a lot of complexity
- How to visualize data effectively and in a real-time fashion so that we can get data insights easily
Once this is achieved, we can easily recognize the visual patterns in data and, based on that, we can derive the information out of it that we need without dealing with the burden of exploring tons of data. So, let me now explain a real scenario that will help you to understand the actual challenge of data capture. I will take a simple use case to explain the issues and will then explain the technologies that can be used to solve them.
If we consider the ways in which we receive huge amounts of data, then you will note that there are many different sources that we can use to get structured or unstructured data. In this digital world, we use many devices that keep on generating and sending data to a central server where the data is then stored. For instance, the applications that we access generate data, the smartphones or smartwatches we use generate data, and even the cab services, railways, and air travel systems we use for transportation all generate data.
A system and its running processes also generate data, and so, in this way, there are many different ways in which we can get data. We get this data at regular intervals and it either accumulates on the physical drive of a computer or, more frequently, it can be hidden within data centers that are hard to fetch and explore. In order to explore this data and to analyze it, we need to extract (ship) it from different locations (such as from log files, databases, or applications), convert it from an unstructured data format into a structured data format (transform), and then push the transformed data into a central place (store) where we can access it for analysis. This flow of data streaming in the system requires a proper architecture to be shipped, transformed, stored, and accessed in a scalable and distributed way.
End users, driven by the need to process increasingly higher volumes of data while maintaining real-time query responses, have turned away from more traditional, relational database or data warehousing solutions, due to poor scalability or performance. The solution is increasingly found in highly distributed, clustered data stores that can easily be monitored. Let's take the example of application monitoring, which is one of the most common use cases we meet across industries. Each application logs data, sometimes in a centralized way (for example, by using syslog), and sometimes all the logs are spread out across the infrastructure, which makes it hard to have a single point of access to the data stream.
The majority of large organizations don't retain logged data for longer than the duration of a log file rotation (that is, a few hours or even minutes). This means that, by the time an issue has occurred, the data that could provide the answers is lost.
So, when you actually have the data, what do you do? Well, there are different ways to extract the gist of logs. A lot of people start by using a simple string pattern search (GREP). Essentially, they try to find matching patterns in logs using a regular expression. That might work for a single log file but, if you want to search something from different log files, then you need to open individual log files for each date to apply the regular expression.
GREP is convenient but, clearly, it doesn't fit our need to react quickly to failure in order to reduce the Mean Time To Recovery (MTTR). Think about it: what if we were talking about a major issue in the purchasing API of an e-commerce website? What if the users experience a high latency on this page or, worse, can't go to the end of the purchase process? The time you will spend trying to recover your application from gigabytes of logs is money you could potentially lose. Another potential issue could be around a lack of security analytics and not being able to blacklist the IPs that try to brute force your application.
In the same context, I've seen use cases where people didn't know that every night there was a group of IPs attempting to get into their system, and this was just because they were not able to visualize the IPs on a map and trigger alerts based on their value. A simple, yet very effective, pattern in order to protect the system would have been to limit access to resources or services to the internal system only. The ability to whitelist access to a known set of IP addresses is essential. The consequence could be dramatic if a proper data-driven architecture with a solid visualization layer is not serving those needs. For example, it could lead to a lack of visibility and control, an increase in the MTTR, customer dissatisfaction, financial impact, security leaks, and bad response times and user experiences.
Here, we will discuss different aspects of data analysis such as data shipping, data ingestion, data storage, and data visualization. These are all very important aspects of data analysis and visualization, and we need to understand each of them in detail. The objective is to then understand how to avoid any confusion, and build an architecture that will serve the different following aspects.
Data-shipping architecture should support any sort of data or event transport that is either structured or unstructured. The primary goal of data shipping is to send data from remote machines to a centralized location in order to make it available for further exploration. For data shipping, we generally deploy lightweight agents that sit on the same server from where we want to get the data. These shippers fetch the data and keep on sending them to the centralized server. For data shipping, we need to consider the following:
- The agents should be lightweight. They should not take resources with the process that generates the actual data, in order to minimize the performance impact and place fewer footprints on it.
- There are a lot of data shipping technologies out there; some of them are tied to a specific technology, while others are based on an extensible framework that can adapt relatively to a data source.
- Shipping data is not only about sending data over the wire; in fact, it's also about security and making sure that the data is sent to the proper destination with an end-to-end secured pipeline.
- Another aspect of data shipping is the management of data loads. Shipping data should be done relative to the load that the end destination is able to ingest; this feature is called back pressure management.
It's essential for data visualization to rely on reliable data shipping. As an example, consider data flowing from financial trade machines and how critical it could be not to be able to detect a security leak just because you are losing data.
The scope of an ingestion layer is to receive data, encompassing as wide a range of commonly used transport protocols and data formats as possible, while providing capabilities to extract and transform this data before finally storing it.
Processing data can somehow be seen as extracting, transforming, and loading (ETL) data, which is often called an ingestion pipeline and, essentially, receives data from the shipping layer to push it to a storage layer. It comes with the following features:
- Generally, the ingestion layer has a pluggable architecture to ease integration with the various sources of data and destinations, with the help of a set of plugins. Some of the plugins are made for receiving data from shippers, which means that data is not always received from shippers and can come directly from a data source such as a file, a network, or even a database. It can be ambiguous in some cases: should I use a shipper or a pipeline to ingest data from the file? It will, of course, depend on the use case and also on the expected SLAs.
- The ingestion layer should be used to prepare the data by, for example, parsing the data, formatting the data, doing the correlation with other data sources, and normalizing and enriching the data before storage. This has many advantages, but the most important advantage is that it can improve the quality of the data, providing better insights for visualization. Another advantage could be to remove processing overheads later on, by precomputing a value or looking up a reference. The drawback of this is that you may need to ingest the data again if the data is not properly formatted or enriched for visualization. Hopefully, there are some ways to process the data after it has been ingested.
- Ingesting and transforming data consumes compute resources. It is essential that we consider this, usually in terms of maximum data throughput per unit, and plan for ingestion by distributing the load over multiple ingestion instances. This is a very important aspect of real-time visualization, or, to be precise, near real-time visualization. If ingestion is spread across multiple instances, it can accelerate the storage of the data and, therefore, make it available faster for visualization.
Storage is undoubtedly the masterpiece of the data-driven architecture. It provides the essential, long-term retention of your data. It also provides the core functionality to search, analyze, and discover insights in your data. It is the heart of the process. The action will depend on the nature of the technology. Here are some aspects that the storage layer usually brings:
- Scalability is the main aspect, that is, the storage used for various volumes of data that could start from gigabytes to terabytes to petabytes of data. The scalability is horizontal, which means that, as demand and volume grow, you should be able to increase the capacity of the storage seamlessly by adding more machines.
- Most of the time, a non-relational and highly distributed data store, which allows fast data access and analysis at a high volume and on a variety of data types, is used, namely, a NoSQL data store. Data is partitioned and spread over a set of machines in order to balance the load while reading or writing data.
- For data visualization, it's essential that the storage exposes an API to make analysis on top of the data. Letting the visualization layer do the statistical analysis, such as grouping data over a given dimension (aggregation), wouldn't
- The nature of the API can depend on the expectation of the visualization layer, but most of the time it's about aggregations. The visualization should only render the result of the heavy lifting done at the storage level.
- A data-driven architecture can serve data to a lot of different applications and users, and for different levels of SLAs. High availability becomes the norm in such architectures, and, like scalability, it should be part of the nature of the solution.
The visualization layer is the window on the data. It provides a set of tools to build live graphs and charts to bring the data to life, allowing you to build rich, insightful dashboards that answer the questions: What is happening now? Is my business healthy? What is the mood of the market?
The visualization layer in a data-driven architecture is a layer where we expect the majority of the data consumption and is mostly focused on bringing KPIs on top of stored data. It comes with the following essential features:
- It should be lightweight and only render the result of the processing done in the storage layer
- It allows the user to discover the data and get quick out-of-the-box insights on the data
- It offers a visual way to ask unexpected questions to the data, rather than having to implement the proper request to do that
- In modern data architectures that must address the needs of accessing KPIs as fast as possible, the visualization layer should render the data in near real time
- The visualization framework should be extensible and allow users to customize the existing assets or to add new features depending on the need
- The user should be able to share the dashboards outside of the visualization application
As you can see, it's not only a matter of visualization. You need some foundations to reach the objectives. This is how we'll address the use of Kibana in this book: we'll focus on use cases and see what is the best way to leverage Kibana features, depending on the use case and context.
The main differentiator from the other visualization tools is that Kibana comes alongside a full stack, the Elastic Stack, with seamless integration with every layer of the stack, which just eases the deployment of such architecture. There are a lot of other technologies out there; we'll now explore what they are good at and what their limits are.
In this section, we will try to analyze why some technologies have limitations and are not able to support end-to-end solutions for a given problem when we try to fulfill the expectations of a data-driven architecture. In these situations, either we use a set of tools to fulfill the requirement or we make certain compromises with the requirement as per the feature availability of that technology. So, let's now discuss some of the available technologies.
Relational databases are popular and important tools that people use to store their data in the context of a data-driven architecture; for example, we can save the application monitoring logs in a database such as MySQL that can later be used to monitor the application. But when it comes to data visualization, it starts to break all the essential features we mentioned earlier:
- A Relational Database Management System (RDBMS) only manages fixed schemas and is not designed to deal with dynamic data models and unstructured data. Any structural changes made on the data will require updating the schema/tables, which, as everybody knows, is expensive.
- RDBMS doesn't allow real-time data access at scale. It wouldn't be realistic, for example, to create an index for each column, for each table, or for each schema in an RDBMS; however, essentially, that is what would be required for real-time access.
- Scalability is not the easiest thing for RDBMSes; it can be a complex and heavy process to put in place and wouldn't scale against a data explosion.
RDBMSes should be used as a source of data that can be used before ingestion time in order to correlate or enrich the ingested data to have a better granularity in the visualized data. Visualization is about providing users with the flexibility to create multiple views of the data, and enabling them to explore and ask their own questions without predefining a schema or constructing a view in the storage layer.
The Hadoop ecosystem is pretty rich in terms of projects. It's often hard to pick or understand which project will fit our requirements; if we step back, we can consider the following aspects that Hadoop fulfills:
- It fits for massive-scale data architecture and will help to store and process any kind of data, and for any level of volume
- It has out-of-the-box batch and streaming technologies that will help to process the data as it comes in to create an iterative view on top of the raw data, or enable longer processing for larger-scale views
- The underlying architecture is made to make the integration of processing engines easy, so you can plug and process your data with a lot of different frameworks
- It's made to implement the data lake paradigms where you can essentially drop in data in order to process it
But what about visualization? Well, there are tons of initiatives out there, but the problem is that none of them can go against the real nature of Hadoop, which doesn't help for real-time data visualization at scale:
- The Hadoop Distributed File System (HDFS) is a sequential read-and-write filesystem that doesn't help for random access.
- Even the interactive ad hoc query or the existing real-time API doesn't scale in terms of integration with the visualization application. Most of the time, the user has to export their data outside of Hadoop in order to visualize it; some visualizations claim to have a transparent integration with HDFS, whereas, under the hood, the data is exported and loaded in the memory in batches, which makes the user experience pretty heavy and slow.
- Data visualization is all about APIs and easy access to the data, which Hadoop is not good at, as it always requires implementation from the user.
Hadoop is good for processing data, and is often used conjointly with other real-time technology, such as Elastic, to build Lambda architectures, as shown in the following diagram:
In this architecture, you can see that Hadoop aggregates incoming data either in a long processing zone or a near real-time zone. Finally, the results are indexed in Elasticsearch in order to be visualized in Kibana. Essentially, this means that one technology is not meant to replace the other, but that you can leverage the best of both.
There are many different, very performant, and massively scalable NoSQL technologies out there, such as key-value stores, document stores, and columnar stores; however, most of them do not serve analytic APIs or come with an out-of-the-box visualization application.
In most cases, the data that these technologies are using is ingested in an indexation engine, such as Elasticsearch, to provide analytics capabilities for visualization or search purposes.
With the fundamental layers that a data-driven architecture should have and the limits identified in existing technologies in the market, let's now introduce the Elastic Stack, which essentially answers these shortcomings.
The Elastic Stack, formerly called ELK, provides the different layers that are needed to implement a data-driven architecture.
It starts from the ingestion layer with Beats, Logstash, and the ES-Hadoop connector, then to a distributed data store with Elasticsearch, and, finally, to the visualization layer with Kibana, as shown in the following diagram:
As we can see in the preceding diagram, Kibana is just one component of it.
In the following chapters, we'll focus in detail on how to use Kibana in different contexts, but we'll always need the other components. That's why you will need to understand the roles of each of them in this chapter.
One other important thing to note is that this book intends to describe how to use Kibana 7.0; therefore, in this book, we'll use Elastic Stack 7.0.0 (https://www.elastic.co/blog/elastic-stack-7-0-0-released).
Elasticsearch is a distributed and scalable data store from which Kibana will pull out all the aggregation results that are used in the visualization. It's resilient by nature and is designed to scale out, which means that nodes can be added to an Elasticsearch cluster depending on the needs, in a very simple way.
Elasticsearch is a highly available technology, which means the following:
- First, data is replicated across the cluster so that if there is a failure, then there is still at least one copy of the data
- Secondly, thanks to its distributed nature, Elasticsearch can spread the indexing and searching load over the cluster nodes, to ensure service continuity and respect to your SLAs
It can deal with structured and unstructured data, and, as we visualize data in Kibana, you will notice that data, or documents to use Elastic vocabulary, are indexed in the form of JSON documents. JSON makes it very handy to deal with complex data structures as it supports nested documents, arrays, and more.
Elasticsearch is a developer-friendly solution and offers a large set of REST APIs to interact with the data, or the settings of the cluster itself. The documentation for these APIs can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html.
The parts that will be interesting for this book are mainly aggregations and graphs, which will be used to make analytics on top of the indexed data (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html) and create relationships between documents (https://www.elastic.co/guide/en/graph/current/graph-api-rest.html).
On top of these APIs, there are also client APIs that allow Elasticsearch to be integrated with most technologies such as Java, Python, Go, and more (https://www.elastic.co/guide/en/elasticsearch/client/index.html.).
Kibana generates the requests made to the cluster for each visualization. We'll examine, in this book, how to dig into it and what features and APIs have been used.
The final main aspect for Elasticsearch is that it's a real-time technology that allows working with all ranges of volumes, from gigabytes to petabytes, with the different APIs.
Besides Kibana, there are a lot of different solutions that can leverage the open APIs that Elasticsearch offers to build visualization on top of the data; however, Kibana is the only technology that is dedicated to it.
Beats is a lightweight data shipper that transports data from different sources such as applications, machines, or networks. We can install and configure Beats on any server to start receiving data. The following diagram shows how we can get data from different servers:
In the preceding diagram, we can see that Filebeat, Metricbeat, and Packetbeat send data to Elasticsearch, and this is then sent to Kibana for analysis or visualization. Note that we can also send data to Logstash from Beats if we want any sort of transformation of the data before sending it to Elasticsearch. Beats is built on top of libbeat, which is an open source library that allows every flavor of Beats to send data to Elasticsearch, as illustrated in the following diagram:
The preceding diagram shows the following Beats:
- Packetbeat: This essentially sniffs packets over the network wire for specific protocols such as MySQL and HTTP. It grabs all the fundamental metrics that will be used to monitor the protocol in question. For example, in the case of HTTP, it will get the request, the response, and then wrap it into a document, and index it into Elasticsearch. We'll not use this Beat in the book, as it would require a full book on it, so I encourage you to go on the following website to see what kind of Kibana dashboard you can build on top of it: http://demo.elastic.co.
- Filebeat: This is meant to securely transport the content of a file from point A to point B, such as the tail command. We'll use this Beat jointly with the new ingest node (https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) to push the data from a file directly to Elasticsearch, which will process the data before indexing it. The architecture can then be simplified, as shown in the following diagram:
- In the preceding diagram, the data is first shipped by Beats, put into a message broker (we'll come back to this notion later in the book), processed by Logstash, and then indexed by Elasticsearch. The ingest node dramatically simplifies the architecture for the use case:
As the preceding diagrams show, the architecture is reduced to two components, with Filebeat and the ingest node. Following this, we'll then be able to visualize the content in Kibana.
- Topbeat: This is the first kind of Metricbeat that allows us to ship machines or application execution metrics to Elasticsearch. We'll also use it later on in this book to ship our laptop data and visualize it in Kibana. The good thing here is that this Beat comes with prebuilt templates that only need to be imported in Kibana, as the document generated by the Beat is standard.
- There are a lot of different Beats made by the community that can be used for creating interesting data visualizations. A list of them can be found at https://www.elastic.co/guide/en/beats/libbeat/current/index.html.
While these Beats offer some basic filtering features, they don't provide the level of transformation that Logstash can bring.
Logstash is a data processor that embraces the centralized data processing paradigm. It allows users to collect, enrich/transform, and transport data to other destinations with the help of more than 200 plugins, as shown in the following diagram:
Logstash is capable of collecting data from any source, including Beats, as every Beat comes with out-of-the-box integration for Logstash. The separation of roles is clear here: while Beats is responsible for shipping the data, Logstash allows for processing the data before it is indexed. From a data visualization point of view, Logstash should be used in order to prepare the data; we will see, for example, later in this book, that you could receive IP addresses in logs from which it might be useful to deduce a specific geolocation.
Kibana is the core product described in this book; it's where all the user interface actions happen. Most of the visualization technology handles the analytical processing, whereas Kibana is just a web application that renders analytical processing done by Elasticsearch. It doesn't load data from Elasticsearch and process it, but it leverages the power of Elasticsearch to do all the heavy lifting. This essentially allows real-time visualization at scale: as the data grows, the Elasticsearch cluster is scaled relatively, to offer the best latency depending on the SLAs.
Kibana provides visual power to the Elasticsearch aggregations, allowing you to slice through your time series datasets, or segment your data fields, as easy as pie.
Kibana is fitted for time-based visualization, even if your data can come without any timestamp, and brings visualization made for rendering the Elasticsearch aggregation framework. The following screenshot shows an example of a dashboard built in Kibana:
As you can see, a dashboard contains one or more visualizations. We'll dig into them one by one in the context of our use cases. To build a dashboard, the user is brought into a data exploration experience where they can do the following:
- Discover data by digging into the indexed document, as shown in the following screenshot:
- Build visualizations with the help of a comprehensive palette, based on the question the user has for the data:
The preceding visualization shows the vehicles involved in an accident in Paris. In this example, the first vehicle was a Car, the second was a Motor Scooter, and the third was a Van. We will dig into the accidentology dataset in the logging use case.
- Build an analytics experience by composing the different visualizations in a dashboard.
Finally, I would like to introduce the concept of X-Pack, which will also be used in the book. While X-Pack is part of the subscription offer, you can download it on the Elastic website and use a trial license to evaluate it.
X-Pack is a set of plugins for Elasticsearch and Kibana that provides the following enterprise features.
Security helps to secure the architecture at the data and access level. On the access side, the Elasticsearch cluster can be integrated with an LDAP, active directory, and PKI to enable role-based access on the cluster. There are additional ways to access it, either by what we call a native realm, which is local to the cluster, or a custom realm, to integrate with other sources of authentication (https://www.elastic.co/guide/en/x-pack/current/security-getting-started.html).
By adding role-based access to the cluster, users will only see data that they are allowed to see at the index level, document level, and field level. From a data visualization point of view, this means, for example, that if a set of users are sharing data within the same index, but the first set is only allowed to see French data and the other set can only see German data, they could both have a Kibana instance pointing to the index, but the underlying permissions configuration, renders the respective country data.
On the data side, the transport layer between Elasticsearch nodes can be encrypted. Transport can also be secured at the Elasticsearch and Kibana level, which means that the Kibana URL can be behind HTTPS. Finally, the security plugin provides IP filtering, but more importantly for data visualization, audit logging that tracks all the access to the cluster and can be easily rendered as a Kibana dashboard.
Monitoring is a Kibana plugin that gives insights into the infrastructure. While this was made primarily for Elasticsearch, Elastic is extending it for other parts of the architecture, such as Kibana or Logstash. That way, users will have a single point of monitoring of all Elastic components and can track, for example, whether Kibana is executing properly, as shown in the following screenshot:
As you can see, users are able to see how many concurrent connections are made on Kibana, as well as deeper metrics such as the event loop delay, which essentially represents the performance of the Kibana instance.
If alerting is combined with monitoring data, it then enables proactive monitoring of both your Elastic infrastructure and your data. The alerting framework lets you describe queries to schedule an action in the background to define the following:
- When you want to run the alert; in other words, to schedule the execution of the alert
- What you want to watch by setting a condition that leverages Elasticsearch search, aggregations, and graph APIs
- What you want to do when the watch is triggered, that is, write the result in a file, in an index, send it by email, or send it over HTTP
The watch states are also indexed in Elasticsearch, which allows visualization to see the life cycle of the watch. Typically, you would render a graph that monitors certain metrics and the related triggered watches, as shown in the following diagram:
The important aspect of the preceding visualization is that the user can see when the alerts have been triggered and how many of them have been triggered, depending on a threshold. We will use alerting later in this book in a metric analytics use case based on the performance of the CPU.
Reporting is a new plugin brought in with the latest version, 2.x, to allow users to export the Kibana dashboard as a PDF. This is one of the most awaited features from Kibana users, and is as simple as clicking on a button, as illustrated in the following screenshot:
The PDF generation is put into a queue and users can follow the export process, and then download the PDF.
At this point, you should have a clear view of the different components that are required to build up a data-driven architecture. We have also examined how the Elastic Stack fits this need, and have learned how Kibana requires the other components of the stack in order to ship, transform, and store the visualized data. We have also covered how Beats can be quite handy to get data from different servers without putting an extra burden on the servers.
In the next chapter, we'll demonstrate how to get started with Kibana and install all the components you need to see your first dashboard.