Packt+ | Advance your knowledge in tech

You're reading from Implementing Splunk: Big Data Reporting and Development for Operational Intelligence

Product typeBook

Published inJan 2013

PublisherPackt

ISBN-139781849693288

Edition1st Edition

Tools

Splunk

Concepts

Big Data

Author (1)

VINCENT BUMGARNER

Chapter 9. Summary Indexes and CSV Files

As the number of events retrieved by a query increases, performance decreases linearly. Summary indexing allows you to calculate statistics in advance, then run reports against these "roll ups", dramatically increasing performance.

Understanding summary indexes

A summary index is a place to store events calculated by Splunk. Usually, these events are aggregates of raw events broken up over time, for instance, how many errors occurred per hour. By calculating this information on an hourly basis, it is cheap and fast to run a query over a longer period of time, for instance, days, weeks, or months.

A summary index is usually populated from a saved search with Summary indexing enabled as an action. This is not the only way, but is certainly the most common.

On disk, a summary index is identical to any other Splunk index. The difference is solely the source of data. We create the index through configuration or through the GUI like any other index, and we manage the index size in the same way.

Note

Think of an index like a table, or possibly a tablespace in a typical SQL database. Indexes are capped by size and/or time, much like a tablespace, but all the data is stored together, much like a table. We will discuss index management...

When to use a summary index

When the question you want to answer requires looking at all or most events for a given source type, very quickly the number of events can become huge. This is what is generally referred to as a "dense search".

For example, if you want to know how many page views happened on your website, the query to answer this question must inspect every event. Since each query uses a processor, we are essentially timing how fast our disk can retrieve the raw data and how fast a single processor can decompress that data. Doing a little math:

1,000,000 hits per day /

10,000 events processed per second =

100 seconds

If we use multiple indexers, or possibly buy much faster disks, we can cut this time, but only linearly. For instance, if the data is evenly split across four indexers, without changing disks, this query will take roughly 25 seconds.

If we use summary indexing, we should be able to improve our times dramatically. Let's assume we have calculated hit counts per five...

When to not use a summary index

There are several cases where summary indexes are either inappropriate or inefficient. Consider the following:

When you need to see the original events: In most cases, summary indexes are used to store aggregate values. A summary index could be used to store a separate copy of events, but this is not usually the case. The more events you have in your summary index, the less advantage it has over the original index.
When the possible number of categories of data is huge: For example, if you want to know the top IP addresses seen per day, it may be tempting to simply capture a count of every IP address seen. This can still be a huge amount of data, and may not save you a lot of search time, if any. Likewise, simply storing the top 10 addresses per slice of time may not give an accurate picture over a long period of time. We will discuss this scenario under the Calculating top for a large time frame section.
When it is impractical to slice the data across sufficient...

Populating summary indexes with saved searches

A search to populate a summary index is much like any other saved search (see Chapter 2, Understand Search, for more detail on creating saved searches). The differences are that this search will run periodically and the results will be stored in the summary index. Let's build our first summary search by following these steps:

Start with a search that produces some statistic:
```
source="impl_splunk_gen" | stats count by user
```
Save this search as summary - count by user.
Edit the search in Manager by navigating to Manager | Searches and reports | summary – count by user. The Save search... wizard provides a link to the manager on the last dialog in the wizard.
Set the appropriate times. This is a somewhat complicated discussion. See the How latency affects summary queries section discussed later.

Let's look at the following fields:

Search: source="impl_splunk_gen" | stats count by user
This is our query. Later we will use sistats, a special summary index...

Using summary index events in a query

After the query to populate the summary index has run for some time, we can use the results in other queries.

If you're in a hurry, or need to report against slices of time before the query was created, you will need to "backfill" your summary index. See the How and when to backfill summary data section for details about calculating summary values for past events.

First, let's look at what actually goes into the summary index:

08/15/2012 10:00:00, search_name="summary - count by user", search_now=1345046520.000, info_min_time=1345042800.000, info_max_time=1345046400.000, info_search_time=1345050512.340, count=17, user=mary

Breaking this event down, we have:

08/15/2012 10:00:00: This is the time at the beginning of this block of data. This is consistent with how timechart and bucket work.
search_name="summary - count by user": This is the name of the search. This is usually the easiest way to find the results you are interested in.
search_now ... info_search_time...

Using sistats, sitop, and sitimechart

So far we have used the stats command to populate our summary index. While this works perfectly well, the si* variants have a couple of advantages:

The remaining portion of the query does not have to be rewritten. For instance, stats count still works as if you were counting the raw events.
stats functions that require more data than what happened in that slice of time will still work. For example, if your time slices each represent an hour, it is not possible to calculate the average value for a day using nothing but the average of each hour. sistats keeps enough information to make this work.

There are a few fairly serious disadvantages to be aware of:

The query using the summary index must use a subset of the functions and split fields that were in the original populating query. If the subsequent query strays from what is in the original sistats data, the results may be unexpected and difficult to debug. For example:
- The following code works fine:
```
source...
```

How latency affects summary queries

Latency is the difference between the time assigned to an event (usually parsed from the text) and the time it was written to the index. Both times are captured, in _time and _indextime, respectively.

This query will show us what our latency is:

sourcetype=impl_splunk_gen
  | eval latency = _indextime - _time
  | stats min(latency) avg(latency) max(latency)

In my case, these statistics look as shown in the following screenshot:

The latency in this case is exaggerated, because the script behind impl_splunk_gen is creating events in chunks. In most production Splunk instances, the latency is usually just a few seconds. If there is any slowdown, perhaps because of network issues, the latency may increase dramatically, and so it should be accounted for.

This query will produce a table showing the time for every event:

sourcetype=impl_splunk_gen
  | eval latency = _indextime - _time
  | eval time=strftime(_time,"%Y-%m-%d %H:%M:%S.%3N")
  | eval indextime=strftime...

How and when to backfill summary data

If you are building reports against summary data, you of course need enough time represented in your summary index. If your report represents only a day or two, then you can probably just wait for the summary to have enough information. If you need the report to work sooner rather than later, or the time frame is longer, then you can backfill your summary index.

Using fill_summary_index.py to backfill

The fill_summary_index.py script allows you to backfill the summary index for any time period you like. It does this by running the saved searches you have defined to populate your summary indexes, but for the time periods you specify.

To use the script, follow the given procedure:

Create your scheduled search, as detailed previously in the Populating summary indexes with saved searches section.
Log in to the shell on your Splunk instance. If you are running a distributed environment, log in to the search head.
Change directories to the Splunk bin directory....

Reducing summary index size

If the saved search populating a summary index produces too many results, the summary index is less effective at speeding up searches. This usually occurs because one or more of the fields used for grouping has more unique values than is expected.

One common example of a field that can have many unique values is the URL in a web access log. The number of URL values might increase in instances where:

The URL contains a session ID
The URL contains search terms
Hackers are throwing URLs at your site trying to break in
Your security team runs tools looking for vulnerabilities

On top of this, multiple URLs can represent exactly the same resource, as follows:

/home/index.html
/home/
/home/index.html?a=b
/home/?a=b

We will cover a few approaches to flatten these values. These are just examples and ideas, as your particular case may require a different approach.

Using eval and rex to define grouping fields

One way to tackle this problem is to make up a new field from the URL...

Calculating top for a large time frame

One common problem is to find the top contributors out of some huge set of unique values. For instance, if you want to know what IP addresses are using the most bandwidth in a given day or week, you may have to keep track of the total of request sizes across millions of unique hosts to definitively answer this question. When using summary indexes, this means storing millions of events in the summary index, quickly defeating the point of summary indexes.

Just to illustrate, let's look at a simple set of data:

Time	1.1.1.1	2.2.2.2	3.3.3.3	4.4.4.4	5.5.5.5	6.6.6.6
12:00	99	100	100	100
13:00	99		100	100	100
14:00	99	100		101	100
15:00	99		99	100	100
16:00	99	100			100	100
total	495	300	299	401	400	100

If we only stored the top three IPs per hour, our data set would look like the following:

Storing raw events in a summary index

Sometimes it is desirable to copy events to another index. I have seen a couple of reasons for doing this, namely:

Differing retention: If some special events need to be kept indefinitely, but the index where they are initially captured rolls off after some period of time, they can be captured into a summary index
Enrichment: Sometimes the enrichment of data is too expensive to happen with every query, or it is important to capture events with the values from a lookup as the values existed at a particular point in time

The process is essentially the same as creating any summary index events. Follow these steps:

Create a populating query.
Add interesting fields using the fields command.
Add a search_name field to the search definition.
Include _time, but rename _raw to raw.

Let's capture all errors that mary sees, enriched with some extra data. First, create the query:

sourcetype=impl_splunk_gen mary error
| eval raw=_raw
| table _time raw department city

Save...

Using CSV files to store transient data

Sometimes it is useful to store small amounts of data outside of a Splunk index. Using the inputcsv and outputcsv commands, we can store tabular data in CSV files on the filesystem.

Pre-populating a dropdown

If a dashboard contains a dynamic dropdown, you must use a search to populate the dropdown. As the amount of data increases, the query to populate the dropdown will run more and more slowly, even from a summary index. We can use a CSV file to store just the information needed, simply adding new values when they occur.

First, we build a query to generate the CSV file. This query should be run over as much data as possible:

source="impl_splunk_gen"
  | stats count by user
  | outputcsv user_list.csv

Next, we need a query to run periodically that will append any new entries to the file. Schedule this query to run periodically as a saved search:

source="impl_splunk_gen"
  | stats count by user
  | append [inputcsv user_list.csv]
  | stats sum(count) as count...

Summary

In this chapter, we have explored the use of summary indexes and the commands surrounding them. While summary indexes are not always the answer, they can be very useful for particular problems. We also explored alternative approaches using CSV files for interim storage.

Summary indexes have long been a hotbed of development at Splunk, and I know there has been major work done for Splunk 5, increasing the speed of some summary queries dramatically.

In our next chapter we will dive into the configuration files that drive Splunk.

The rest of the chapter is locked

You have been reading a chapter from

Implementing Splunk: Big Data Reporting and Development for Operational Intelligence

Published in: Jan 2013Publisher: PacktISBN-13: 9781849693288

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

VINCENT BUMGARNER

Vincent Bumgarner has been designing software for over 20 years, working with many languages on nearly as many platforms. He started using Splunk in 2007 and has enjoyed watching the product evolve over the years. While working for Splunk, he has helped many companies train dozens of users to drive, extend, and administer this extremely flexible product. At least one person in every company he has worked with has asked for a book, and he hopes that this book will help fill their shelves.
Read more about VINCENT BUMGARNER

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Time	2.2.2.2	3.3.3.3	4.4.4.4	5.5.5.5
12:00	100	100	100
13:00		100	100	100
14:00	100		101...