A summary index is a place to store events calculated by Splunk. Usually, these events are aggregates of raw events broken up over time, for instance, how many errors occurred per hour. By calculating this information on an hourly basis, it is cheap and fast to run a query over a longer period of time, for instance, days, weeks, or months.
A summary index is usually populated from a saved search with Summary indexing enabled as an action. This is not the only way, but is certainly the most common.
On disk, a summary index is identical to any other Splunk index. The difference is solely the source of data. We create the index through configuration or through the GUI like any other index, and we manage the index size in the same way.
When the question you want to answer requires looking at all or most events for a given source type, very quickly the number of events can become huge. This is what is generally referred to as a "dense search".
For example, if you want to know how many page views happened on your website, the query to answer this question must inspect every event. Since each query uses a processor, we are essentially timing how fast our disk can retrieve the raw data and how fast a single processor can decompress that data. Doing a little math:
1,000,000 hits per day /
10,000 events processed per second =
100 seconds
If we use multiple indexers, or possibly buy much faster disks, we can cut this time, but only linearly. For instance, if the data is evenly split across four indexers, without changing disks, this query will take roughly 25 seconds.
If we use summary indexing, we should be able to improve our times dramatically. Let's assume we have calculated hit counts per five...
There are several cases where summary indexes are either inappropriate or inefficient. Consider the following:
When you need to see the original events: In most cases, summary indexes are used to store aggregate values. A summary index could be used to store a separate copy of events, but this is not usually the case. The more events you have in your summary index, the less advantage it has over the original index.
When the possible number of categories of data is huge: For example, if you want to know the top IP addresses seen per day, it may be tempting to simply capture a count of every IP address seen. This can still be a huge amount of data, and may not save you a lot of search time, if any. Likewise, simply storing the top 10 addresses per slice of time may not give an accurate picture over a long period of time. We will discuss this scenario under the Calculating top for a large time frame section.
When it is impractical to slice the data across sufficient...
A search to populate a summary index is much like any other saved search (see Chapter 2, Understand Search, for more detail on creating saved searches). The differences are that this search will run periodically and the results will be stored in the summary index. Let's build our first summary search by following these steps:
Start with a search that produces some statistic:
source="impl_splunk_gen" | stats count by user
Save this search as
summary - count by user
.Edit the search in Manager by navigating to Manager | Searches and reports | summary – count by user. The Save search... wizard provides a link to the manager on the last dialog in the wizard.
Set the appropriate times. This is a somewhat complicated discussion. See the How latency affects summary queries section discussed later.
Let's look at the following fields:
Search:
source="impl_splunk_gen" | stats count by user
This is our query. Later we will use
sistats
, a special summary index...
After the query to populate the summary index has run for some time, we can use the results in other queries.
If you're in a hurry, or need to report against slices of time before the query was created, you will need to "backfill" your summary index. See the How and when to backfill summary data section for details about calculating summary values for past events.
First, let's look at what actually goes into the summary index:
08/15/2012 10:00:00, search_name="summary - count by user", search_now=1345046520.000, info_min_time=1345042800.000, info_max_time=1345046400.000, info_search_time=1345050512.340, count=17, user=mary
Breaking this event down, we have:
08/15/2012 10:00:00
: This is the time at the beginning of this block of data. This is consistent with howtimechart
andbucket
work.search_name="summary - count by user"
: This is the name of the search. This is usually the easiest way to find the results you are interested in.search_now ... info_search_time...
So far we have used the stats
command to populate our summary index. While this works perfectly well, the si*
variants have a couple of
advantages:
The remaining portion of the query does not have to be rewritten. For instance,
stats count
still works as if you were counting the raw events.stats
functions that require more data than what happened in that slice of time will still work. For example, if your time slices each represent an hour, it is not possible to calculate the average value for a day using nothing but the average of each hour.sistats
keeps enough information to make this work.
There are a few fairly serious disadvantages to be aware of:
The query using the summary index must use a subset of the functions and split fields that were in the original populating query. If the subsequent query strays from what is in the original
sistats
data, the results may be unexpected and difficult to debug. For example:The following code works fine:
source...
Latency is the difference between the time assigned to an event (usually parsed from the text) and the time it was written to the index. Both times are captured, in _time
and _indextime
, respectively.
This query will show us what our latency is:
sourcetype=impl_splunk_gen | eval latency = _indextime - _time | stats min(latency) avg(latency) max(latency)
In my case, these statistics look as shown in the following screenshot:
The latency in this case is exaggerated, because the script behind impl_splunk_gen
is creating events in chunks. In most production Splunk instances, the latency is usually just a few seconds. If there is any slowdown, perhaps because of network issues, the latency may increase dramatically, and so it should be accounted for.
This query will produce a table showing the time for every event:
sourcetype=impl_splunk_gen | eval latency = _indextime - _time | eval time=strftime(_time,"%Y-%m-%d %H:%M:%S.%3N") | eval indextime=strftime...
If you are building reports against summary data, you of course need enough time represented in your summary index. If your report represents only a day or two, then you can probably just wait for the summary to have enough information. If you need the report to work sooner rather than later, or the time frame is longer, then you can backfill your summary index.
The fill_summary_index.py
script allows you to backfill the summary index for any time period you like. It does this by running the saved searches you have defined to populate your summary indexes, but for the time periods you specify.
To use the script, follow the given procedure:
Create your scheduled search, as detailed previously in the Populating summary indexes with saved searches section.
Log in to the shell on your Splunk instance. If you are running a distributed environment, log in to the search head.
Change directories to the Splunk
bin
directory....
If the saved search populating a summary index produces too many results, the summary index is less effective at speeding up searches. This usually occurs because one or more of the fields used for grouping has more unique values than is expected.
One common example of a field that can have many unique values is the URL in a web access log. The number of URL values might increase in instances where:
The URL contains a session ID
The URL contains search terms
Hackers are throwing URLs at your site trying to break in
Your security team runs tools looking for vulnerabilities
On top of this, multiple URLs can represent exactly the same resource, as follows:
/home/index.html
/home/
/home/index.html?a=b
/home/?a=b
We will cover a few approaches to flatten these values. These are just examples and ideas, as your particular case may require a different approach.
One common problem is to find the top contributors out of some huge set of unique values. For instance, if you want to know what IP addresses are using the most bandwidth in a given day or week, you may have to keep track of the total of request sizes across millions of unique hosts to definitively answer this question. When using summary indexes, this means storing millions of events in the summary index, quickly defeating the point of summary indexes.
Just to illustrate, let's look at a simple set of data:
Time |
1.1.1.1 |
2.2.2.2 |
3.3.3.3 |
4.4.4.4 |
5.5.5.5 |
6.6.6.6 |
---|---|---|---|---|---|---|
12:00 |
99 |
100 |
100 |
100 | ||
13:00 |
99 |
100 |
100 |
100 | ||
14:00 |
99 |
100 |
101 |
100 | ||
15:00 |
99 |
99 |
100 |
100 | ||
16:00 |
99 |
100 |
100 |
100 | ||
total |
495 |
300 |
299 |
401 |
400 |
100 |
If we only stored the top three IPs per hour, our data set would look like the following:
Time |
1.1.1.1 |
2.2.2.2 |
3.3.3.3 |
4.4.4.4 |
5.5.5.5 |
6.6.6.6 |
---|---|---|---|---|---|---|
12:00 |
100 |
100 |
100 | |||
13:00 |
100 |
100 |
100 | |||
14:00 |
100 |
101... |