Reader small image

You're reading from  Learning Elastic Stack 6.0

Product typeBook
Published inDec 2017
PublisherPackt
ISBN-139781787281868
Edition1st Edition
Right arrow
Authors (2):
Pranav Shukla
Pranav Shukla
author image
Pranav Shukla

Pranav Shukla is the founder and CEO of Valens DataLabs, a technologist, husband, and father of two. He is a big data architect and software craftsman who uses JVM-based languages. Pranav has diverse experience of over 14 years in architecting enterprise applications for Fortune 500 companies and start-ups. His core expertise lies in building JVM-based, scalable, reactive, and data-driven applications using Java/Scala, the Hadoop ecosystem, Apache Spark, and NoSQL databases. He is a big data engineering, analytics, and machine learning enthusiast.
Read more about Pranav Shukla

Sharath Kumar M N
Sharath Kumar M N
author image
Sharath Kumar M N

Sharath Kumar M N did his master's in computer science at the University of Texas, Dallas, USA. He is currently working as a senior principal architect at Broadcom. Prior to this, he was working as an Elasticsearch solutions architect at Oracle. He has given several tech talks at conferences such as Oracle Code events. Sharath is a certified trainer Elastic Certified Instructor one of the few technology experts in the world who has been certified by Elastic Inc. to deliver their official from the creators of Elastic training. He is also a data science and machine learning enthusiast. In his free time, he likes playing with his lovely niece, Monisha; nephew, Chirayu; and his pet, Milo.
Read more about Sharath Kumar M N

View More author details
Right arrow

Chapter 4. Analytics with Elasticsearch

On our journey of learning about Elastic Stack 6.0, we have gained a strong understanding of Elasticsearch. We have learned about the strong foundations of Elasticsearch in the previous two chapters, and gained an in-depth understanding of its search use cases.

The underlying technology Apache Lucene was originally developed for text search use cases. Due to innovations in Apache Lucene and additional innovations in Elasticsearch, it has also emerged as a very powerful analytics engine. In this chapter, we will understand how Elasticsearch can serve as your analytics engine. We will look at the following:

  • The basics of aggregations
  • Preparing data for analysis
  • Metric aggregations
  • Bucket aggregations
  • Pipeline aggregations

We will learn all of this by using a real-world dataset. Let us start by understanding the basics of aggregations.

The basics of aggregations


In contrast to search, analytics deals with the bigger picture. Searching addresses the need for zooming in to a few records; analytics addresses the need for zooming out and slicing the data in different ways. While learning about searching, we used the API of the following form:

POST /<index_name>/<type_name>/_search
{
  "query": 
  {
    ... type of query ...
  }
}

All aggregation queries take a common form. Let us understand the structure.

The aggregations or aggs element allows us to aggregate data. All aggregation requests take the following form:

POST /<index_name>/<type_name>/_search
{  
  "aggs": {                                 
    ... type of aggregation ...
          },
  "query": {  ... type of query ... },              //optional query part
  "size": 0                                         //size typically set to 0
}

The aggs element should contain the actual aggregation query. The body depends on the type of aggregation that...

Preparing data for analysis


We will consider an example of network traffic data generated from Wi-Fi routers. Throughout this chapter, we will analyze the data from this example. It is important to understand what the records in the underlying system look like and what they represent. We will cover the following topics while we prepare and load the data into the local Elasticsearch instance:

  • Understanding the structure of data
  • Loading the data using Logstash

Understanding the structure of data

The following diagram depicts the design of the system, to help you gain a better understanding of the problem and the structure of data collected:

Fig 4.1 Network traffic and bandwidth usage data for Wi-Fi traffic and storage in Elasticsearch

The data is collected by the system with the following objectives:

  • In the left half of the figure, there are multiple squares representing one customer's premises, with the Wi-Fi routers deployed on that site, along with all devices connected to those Wi-Fi routers...

Metric aggregations


Metric aggregations work with numeric data, computing one or more aggregate metrics within the given context. The context could be a query, filter, or no query to include the whole index/type. Metric aggregations can also be nested inside other bucket aggregations. In this case, these metrics will be computed for each bucket in the bucket aggregations.

We will start with simple metric aggregations without nesting them inside bucket aggregations. When we learn about bucket aggregations later in the chapter, we will also learn how to use metric aggregations inside bucket aggregations.

We will learn about the following metric aggregations:

  • Sum, average, min, and max aggregations
  • Stats and extended stats aggregations
  • Cardinality aggregation

Let us learn about them one by one.

Sum, average, min, and max aggregations

Finding the sum of a field, the minimum value for a field, the maximum value for a field, or an average, are very common operations. For the people who are familiar with...

Bucket aggregations


Bucket aggregations are useful to analyze how the whole relates to its parts to gain better insight. They help in segmenting the data into smaller parts. Each type of bucket aggregation slices the data into different segments or buckets. Bucket aggregations are the most common type of aggregation used in any analysis process.

We will cover the following topics, keeping the network traffic data example at the center:

  • Bucketing on string data
  • Bucketing on numeric data
  • Aggregating filtered data
  • Nesting aggregations
  • Bucketing on custom conditions
  • Bucketing on date/time data
  • Bucketing on geo-spatial data

Bucketing on string data

Sometimes, we may need to bucket the data or segment the data based on a field that has a string datatype, typically keyword typed fields in Elasticsearch. This is very common. Some examples of scenarios in which you may want to segment the data by a string typed field are:

  • Segmenting the network traffic data per department
  • Segmenting the network traffic data...

Pipeline aggregations


Pipeline aggregations, as their name suggests, allow you to aggregate over the result of another aggregation. They let you pipe the result of an aggregation as an input to another aggregation. Pipeline aggregations are a relatively new feature and they are still experimental. At a high level, there are two types of pipeline aggregation:

  • Parent pipeline aggregations have the pipeline aggregation nested inside other aggregations
  • Sibling pipeline aggregations have the pipeline aggregation as the sibling of the original aggregation from which pipelining is done

Let us understand how the pipeline aggregations work by taking one example of cumulative sum aggregation, which is a parent of pipeline aggregation.

Calculating the cumulative sum of usage over time

While understanding the Date Histogram aggregation and in the section Focusing on a specific day and changing intervalswe looked at the aggregation, to compute hourly bandwidth usage for one particular day. After completing...

Summary


In this chapter, we have learnt how to use Elasticsearch to build powerful analytics applications. We have covered how to slice and dice the data to get powerful insight. We started with metric aggregation to deal with numeric datatypes. We then covered bucket aggregation to find out how to slice the data into buckets or segments in order to drill down into specific segments.

We also understood how pipeline aggregations work. We did all of this while dealing with a real-world-like dataset of network traffic data. We have seen how flexible Elasticsearch is as an analytics engine. Without much additional data modelling and extra effort, we can analyze any field, even when the data is on a big data scale. This is a rare capability not offered by many data stores. As you will see in Chapter 7Visualizing Data with Kibana, Kibana leverages many of the aggregations that we learnt about in this chapter.

This concludes the chapters on Elasticsearch, the core of Elastic Stack. We have a very...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Elastic Stack 6.0
Published in: Dec 2017Publisher: PacktISBN-13: 9781787281868
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Pranav Shukla

Pranav Shukla is the founder and CEO of Valens DataLabs, a technologist, husband, and father of two. He is a big data architect and software craftsman who uses JVM-based languages. Pranav has diverse experience of over 14 years in architecting enterprise applications for Fortune 500 companies and start-ups. His core expertise lies in building JVM-based, scalable, reactive, and data-driven applications using Java/Scala, the Hadoop ecosystem, Apache Spark, and NoSQL databases. He is a big data engineering, analytics, and machine learning enthusiast.
Read more about Pranav Shukla

author image
Sharath Kumar M N

Sharath Kumar M N did his master's in computer science at the University of Texas, Dallas, USA. He is currently working as a senior principal architect at Broadcom. Prior to this, he was working as an Elasticsearch solutions architect at Oracle. He has given several tech talks at conferences such as Oracle Code events. Sharath is a certified trainer Elastic Certified Instructor one of the few technology experts in the world who has been certified by Elastic Inc. to deliver their official from the creators of Elastic training. He is also a data science and machine learning enthusiast. In his free time, he likes playing with his lovely niece, Monisha; nephew, Chirayu; and his pet, Milo.
Read more about Sharath Kumar M N