Reader small image

You're reading from  Implementing Splunk: Big Data Reporting and Development for Operational Intelligence

Product typeBook
Published inJan 2013
PublisherPackt
ISBN-139781849693288
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
VINCENT BUMGARNER
VINCENT BUMGARNER
author image
VINCENT BUMGARNER

Vincent Bumgarner has been designing software for over 20 years, working with many languages on nearly as many platforms. He started using Splunk in 2007 and has enjoyed watching the product evolve over the years. While working for Splunk, he has helped many companies train dozens of users to drive, extend, and administer this extremely flexible product. At least one person in every company he has worked with has asked for a book, and he hopes that this book will help fill their shelves.
Read more about VINCENT BUMGARNER

Right arrow

Chapter 9. Summary Indexes and CSV Files

As the number of events retrieved by a query increases, performance decreases linearly. Summary indexing allows you to calculate statistics in advance, then run reports against these "roll ups", dramatically increasing performance.

Understanding summary indexes


A summary index is a place to store events calculated by Splunk. Usually, these events are aggregates of raw events broken up over time, for instance, how many errors occurred per hour. By calculating this information on an hourly basis, it is cheap and fast to run a query over a longer period of time, for instance, days, weeks, or months.

A summary index is usually populated from a saved search with Summary indexing enabled as an action. This is not the only way, but is certainly the most common.

On disk, a summary index is identical to any other Splunk index. The difference is solely the source of data. We create the index through configuration or through the GUI like any other index, and we manage the index size in the same way.

Note

Think of an index like a table, or possibly a tablespace in a typical SQL database. Indexes are capped by size and/or time, much like a tablespace, but all the data is stored together, much like a table. We will discuss index management...

When to use a summary index


When the question you want to answer requires looking at all or most events for a given source type, very quickly the number of events can become huge. This is what is generally referred to as a "dense search".

For example, if you want to know how many page views happened on your website, the query to answer this question must inspect every event. Since each query uses a processor, we are essentially timing how fast our disk can retrieve the raw data and how fast a single processor can decompress that data. Doing a little math:

1,000,000 hits per day /

10,000 events processed per second =

100 seconds

If we use multiple indexers, or possibly buy much faster disks, we can cut this time, but only linearly. For instance, if the data is evenly split across four indexers, without changing disks, this query will take roughly 25 seconds.

If we use summary indexing, we should be able to improve our times dramatically. Let's assume we have calculated hit counts per five...

When to not use a summary index


There are several cases where summary indexes are either inappropriate or inefficient. Consider the following:

  • When you need to see the original events: In most cases, summary indexes are used to store aggregate values. A summary index could be used to store a separate copy of events, but this is not usually the case. The more events you have in your summary index, the less advantage it has over the original index.

  • When the possible number of categories of data is huge: For example, if you want to know the top IP addresses seen per day, it may be tempting to simply capture a count of every IP address seen. This can still be a huge amount of data, and may not save you a lot of search time, if any. Likewise, simply storing the top 10 addresses per slice of time may not give an accurate picture over a long period of time. We will discuss this scenario under the Calculating top for a large time frame section.

  • When it is impractical to slice the data across sufficient...

Populating summary indexes with saved searches


A search to populate a summary index is much like any other saved search (see Chapter 2, Understand Search, for more detail on creating saved searches). The differences are that this search will run periodically and the results will be stored in the summary index. Let's build our first summary search by following these steps:

  1. Start with a search that produces some statistic:

    source="impl_splunk_gen" | stats count by user
  2. Save this search as summary - count by user.

  3. Edit the search in Manager by navigating to Manager | Searches and reports | summary – count by user. The Save search... wizard provides a link to the manager on the last dialog in the wizard.

  4. Set the appropriate times. This is a somewhat complicated discussion. See the How latency affects summary queries section discussed later.

Let's look at the following fields:

  • Search: source="impl_splunk_gen" | stats count by user

    This is our query. Later we will use sistats, a special summary index...

Using summary index events in a query


After the query to populate the summary index has run for some time, we can use the results in other queries.

If you're in a hurry, or need to report against slices of time before the query was created, you will need to "backfill" your summary index. See the How and when to backfill summary data section for details about calculating summary values for past events.

First, let's look at what actually goes into the summary index:

08/15/2012 10:00:00, search_name="summary - count by user", search_now=1345046520.000, info_min_time=1345042800.000, info_max_time=1345046400.000, info_search_time=1345050512.340, count=17, user=mary

Breaking this event down, we have:

  • 08/15/2012 10:00:00: This is the time at the beginning of this block of data. This is consistent with how timechart and bucket work.

  • search_name="summary - count by user": This is the name of the search. This is usually the easiest way to find the results you are interested in.

  • search_now ... info_search_time...

Using sistats, sitop, and sitimechart


So far we have used the stats command to populate our summary index. While this works perfectly well, the si* variants have a couple of advantages:

  • The remaining portion of the query does not have to be rewritten. For instance, stats count still works as if you were counting the raw events.

  • stats functions that require more data than what happened in that slice of time will still work. For example, if your time slices each represent an hour, it is not possible to calculate the average value for a day using nothing but the average of each hour. sistats keeps enough information to make this work.

There are a few fairly serious disadvantages to be aware of:

  • The query using the summary index must use a subset of the functions and split fields that were in the original populating query. If the subsequent query strays from what is in the original sistats data, the results may be unexpected and difficult to debug. For example:

    • The following code works fine:

      source...

How latency affects summary queries


Latency is the difference between the time assigned to an event (usually parsed from the text) and the time it was written to the index. Both times are captured, in _time and _indextime, respectively.

This query will show us what our latency is:

sourcetype=impl_splunk_gen
  | eval latency = _indextime - _time
  | stats min(latency) avg(latency) max(latency)

In my case, these statistics look as shown in the following screenshot:

The latency in this case is exaggerated, because the script behind impl_splunk_gen is creating events in chunks. In most production Splunk instances, the latency is usually just a few seconds. If there is any slowdown, perhaps because of network issues, the latency may increase dramatically, and so it should be accounted for.

This query will produce a table showing the time for every event:

sourcetype=impl_splunk_gen
  | eval latency = _indextime - _time
  | eval time=strftime(_time,"%Y-%m-%d %H:%M:%S.%3N")
  | eval indextime=strftime...

How and when to backfill summary data


If you are building reports against summary data, you of course need enough time represented in your summary index. If your report represents only a day or two, then you can probably just wait for the summary to have enough information. If you need the report to work sooner rather than later, or the time frame is longer, then you can backfill your summary index.

Using fill_summary_index.py to backfill

The fill_summary_index.py script allows you to backfill the summary index for any time period you like. It does this by running the saved searches you have defined to populate your summary indexes, but for the time periods you specify.

To use the script, follow the given procedure:

  1. Create your scheduled search, as detailed previously in the Populating summary indexes with saved searches section.

  2. Log in to the shell on your Splunk instance. If you are running a distributed environment, log in to the search head.

  3. Change directories to the Splunk bin directory....

Reducing summary index size


If the saved search populating a summary index produces too many results, the summary index is less effective at speeding up searches. This usually occurs because one or more of the fields used for grouping has more unique values than is expected.

One common example of a field that can have many unique values is the URL in a web access log. The number of URL values might increase in instances where:

  • The URL contains a session ID

  • The URL contains search terms

  • Hackers are throwing URLs at your site trying to break in

  • Your security team runs tools looking for vulnerabilities

On top of this, multiple URLs can represent exactly the same resource, as follows:

  • /home/index.html

  • /home/

  • /home/index.html?a=b

  • /home/?a=b

We will cover a few approaches to flatten these values. These are just examples and ideas, as your particular case may require a different approach.

Using eval and rex to define grouping fields

One way to tackle this problem is to make up a new field from the URL...

Calculating top for a large time frame


One common problem is to find the top contributors out of some huge set of unique values. For instance, if you want to know what IP addresses are using the most bandwidth in a given day or week, you may have to keep track of the total of request sizes across millions of unique hosts to definitively answer this question. When using summary indexes, this means storing millions of events in the summary index, quickly defeating the point of summary indexes.

Just to illustrate, let's look at a simple set of data:

Time

1.1.1.1

2.2.2.2

3.3.3.3

4.4.4.4

5.5.5.5

6.6.6.6

12:00

99

100

100

100

  

13:00

99

 

100

100

100

 

14:00

99

100

 

101

100

 

15:00

99

 

99

100

100

 

16:00

99

100

  

100

100

total

495

300

299

401

400

100

If we only stored the top three IPs per hour, our data set would look like the following:

Storing raw events in a summary index


Sometimes it is desirable to copy events to another index. I have seen a couple of reasons for doing this, namely:

  • Differing retention: If some special events need to be kept indefinitely, but the index where they are initially captured rolls off after some period of time, they can be captured into a summary index

  • Enrichment: Sometimes the enrichment of data is too expensive to happen with every query, or it is important to capture events with the values from a lookup as the values existed at a particular point in time

The process is essentially the same as creating any summary index events. Follow these steps:

  1. Create a populating query.

  2. Add interesting fields using the fields command.

  3. Add a search_name field to the search definition.

  4. Include _time, but rename _raw to raw.

Let's capture all errors that mary sees, enriched with some extra data. First, create the query:

sourcetype=impl_splunk_gen mary error
| eval raw=_raw
| table _time raw department city

Save...

Using CSV files to store transient data


Sometimes it is useful to store small amounts of data outside of a Splunk index. Using the inputcsv and outputcsv commands, we can store tabular data in CSV files on the filesystem.

Pre-populating a dropdown

If a dashboard contains a dynamic dropdown, you must use a search to populate the dropdown. As the amount of data increases, the query to populate the dropdown will run more and more slowly, even from a summary index. We can use a CSV file to store just the information needed, simply adding new values when they occur.

First, we build a query to generate the CSV file. This query should be run over as much data as possible:

source="impl_splunk_gen"
  | stats count by user
  | outputcsv user_list.csv

Next, we need a query to run periodically that will append any new entries to the file. Schedule this query to run periodically as a saved search:

source="impl_splunk_gen"
  | stats count by user
  | append [inputcsv user_list.csv]
  | stats sum(count) as count...

Summary


In this chapter, we have explored the use of summary indexes and the commands surrounding them. While summary indexes are not always the answer, they can be very useful for particular problems. We also explored alternative approaches using CSV files for interim storage.

Summary indexes have long been a hotbed of development at Splunk, and I know there has been major work done for Splunk 5, increasing the speed of some summary queries dramatically.

In our next chapter we will dive into the configuration files that drive Splunk.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Implementing Splunk: Big Data Reporting and Development for Operational Intelligence
Published in: Jan 2013Publisher: PacktISBN-13: 9781849693288
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
VINCENT BUMGARNER

Vincent Bumgarner has been designing software for over 20 years, working with many languages on nearly as many platforms. He started using Splunk in 2007 and has enjoyed watching the product evolve over the years. While working for Splunk, he has helped many companies train dozens of users to drive, extend, and administer this extremely flexible product. At least one person in every company he has worked with has asked for a book, and he hopes that this book will help fill their shelves.
Read more about VINCENT BUMGARNER

Time

1.1.1.1

2.2.2.2

3.3.3.3

4.4.4.4

5.5.5.5

6.6.6.6

12:00

 

100

100

100

  

13:00

  

100

100

100

 

14:00

 

100

 

101...