Bringing in Data

Computerized systems are responsible for much of the data produced on a daily basis. Splunk Enterprise makes it easy to get data from many of these systems. This data is frequently referred to as machine data. And since machines mostly generate data in an ongoing or streaming nature, Splunk is especially useful as it can handle streaming data easily and efficiently.

In addition to capturing machine data, Splunk Enterprise allows you, as the user, to enhance and enrich the data either as it is stored or as it is searched. Machine data can be enriched with business rules and logic for enhanced searching capabilities. Often it is combined with traditional row/column data to provide business context to machine data with data such a product hierarchies.

In this chapter, you will learn about Splunk and how it relates to a often used term - big data, as well as the most...

Splunk and big data

Big data is a widely used term but, as is often the case, one that means different things to different people. In this part of the chapter, we present common characteristics of big data .

There is no doubt that today there is a lot of data, and more commonly today, the term big data is not meant to reference the volume as much as it is characterized by other factors, including variability so wide that legacy, conventional organizational data systems cannot consume and produce analytics from it.

Streaming data

Streaming data is almost always being generated, with a timestamp associated to each entry. Splunk's inherent ability to monitor and track data loaded from ever growing log files, or accept data...

Splunk data sources

Splunk was invented as a way to keep track of and analyze machine data coming from a variety of computerized systems. It is a powerful platform for doing just that. But since its invention, it has been used for a myriad of different data types, including streaming log data, database and spreadsheet data, and data provided by web services. The various types of data that Splunk is often used for are explained in the next few sections.

Machine data

As mentioned previously, much of Splunk's data capability is focused on machine data. Machine data is data created each time a machine does something, even if it is as seemingly insignificant as a successful user login. Each event has information about its...

Creating indexes

Indexes are where Splunk Enterprise stores all the data it has processed. It is essentially a collection of databases that are, by default, located at $SPLUNK_HOME/var/lib/splunk. Before data can be searched, it needs to be indexed—a process we describe here.

Tip from the Fez: There are a variety of intricate settings which can be manipulated to control size and data management aspects of an index. We will not cover those in this book, however as your situation requires complexity, be sure to consider a variety of topics around index management, such as overall size, buckets parameters, archiving and other optimization settings.

There are two ways to create an index, through the Splunk user interface or by creating an indexes.conf file. You will be shown here how to create an index using the Splunk portal, but you should realize that when you do that...

Buckets

You may have noticed that there is a certain pattern in this configuration file, in which folders are broken into three locations: coldPath, homePath, and thawedPath. This is a very important concept in Splunk. An index contains compressed raw data and associated index files which are spread out into age-designated directories. Each age-designated directory is called a bucket.

A bucket moves through several stages as it ages. In general, as your data gets older (think colder) in the system, it is pushed to the next bucket. And, as you can see in the following list, the thawed bucket contains data that has been restored from an archive. Here is a breakdown of the buckets in relation to each other:

hot: This is newly indexed data and open for writing (hotPath)
warm: This is data rolled from the hot bucket with no active writing (warmPath)
cold: This is data rolled from...

Log files as data input

As mentioned earlier in this chapter, any configuration you make in the Splunk portal corresponds to a *.conf file written under the $SPLUNK_HOME directory. The same goes for the creation of data inputs; adding data inputs using the Splunk user interface creates a file called inputs.conf.

For this exercise use the windows_perfmon_logs.txt file provided in the Chapter 2/samples.

Now that you have an index to store Windows logs, let's create a data input for it, with the following steps:

Go to the Splunk home page.
Click on your Destinations app. Make sure you are in the Destinations app before you execute the next steps, or your configuration changes won't be isolated to your application.
In the Splunk navigation bar, select Settings.

Under the Data section, click on Data inputs.
On the Data inputs page, click on Files & directories.
In...

Splunk events and fields

All throughout this chapter, you have been running Splunk search queries that have returned data. It is important to understand what events and fields are before we go any further, for an understanding of these is essential to comprehending what happens when you run Splunk on the data.

In Splunk, data is classified into events and is like a record, such as a log file entry or other type of input data. An event can have many different attributes or fields or just a few. When you run a successful search query, you will see events returned from the Splunk indexes the search is being run against. If you are looking at live streaming data, events can come in very quickly through Splunk.

Every event is given a number of default fields. For a complete listing, go to http://docs.splunk.com/Documentation/Splunk/6.3.2/Data/Aboutdefaultfields. We will now go through...

Extracting new fields

Most raw data that you will encounter will have some form of structure. Just like a CSV (comma-separated value) file or a web log file, it is assumed that each entry in the log corresponds to some sort of format. Splunk makes custom field extraction very easy, especially for delimited files. Let's take the case of our Eventgen data and look at the following example. By design, the raw data generated by Eventgen is delimited by commas. Following is a example of a raw event:

2018-01-18 21:19:20:013632, 130.253.37.97,GET,/destination/PML/details,-,80,- 10.2.1.33,Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J3 Safari/6533.18.5,301,0,0,317,1514

Since there is a distinct separation of fields in this data, we can use Splunk's field extraction capabilities to automatically classify...

Summary

In this chapter, we began learning about big data and its related characteristics, such as streaming data, analytical data latency, and sparseness. We also covered the types of data that can be brought into Splunk. We then created an index and loaded a sample log file, all while examining the configuration file (.conf) entries made at the file system level. We talked about what fields and events are. And finally, we saw how to extract fields from events and name them so that they can be more useful to us.

In the chapters to come, we'll learn more about these important features of Splunk.