This article being crafted by Ashish Kumar Yadav has been picked from Advanced Splunk book. This book helps you to get in touch with a great data science tool named Splunk. The big data world is an ever expanding forte and it is easy to get lost in the enormousness of machine data available at your bay. The Advanced Splunk book will definitely provide you with the necessary resources and the trail to get you at the other end of the machine data. While the book emphasizes on Splunk, it also discusses its close association with Python language and tools like R and Tableau that are needed for better analytics and visualization purpose.
(For more resources related to this topic, see here.)
Splunk supports numerous ways to ingest data on its server. Any data generated from a human-readable machine from various sources can be uploaded using data input methods such as files, directories, TCP/UDP scripts can be indexed on the Splunk Enterprise server and analytics and insights can be derived from them.
Uploading data on Splunk is one of the most important parts of analytics and visualizations of data. If data is not properly parsed, timestamped, or broken into events, then it can be difficult to analyze and get proper insight on the data. Splunk can be used to analyze and visualize data ranging from various domains, such as IT security, networking, mobile devices, telecom infrastructure, media and entertainment devices, storage devices, and many more. The machine generated data from different sources can be of different formats and types, and hence, it is very important to parse data in the best format to get the required insight from it.
Splunk supports machine-generated data of various types and structures, and the following screenshot shows the common types of data that comes with an inbuilt support in Splunk Enterprise. The most important point of these sources is that if the data source is from the following list, then the preconfigured settings and configurations already stored in Splunk Enterprise are applied. This helps in getting the data parsed in the best and most suitable formats of events and timestamps to enable faster searching, analytics, and better visualization.
The following screenshot enlists common data sources supported by Splunk Enterprise:
Any format of structured data can be uploaded on Splunk. However, if the data is from any of the preceding formats, then predefined settings and configuration can be applied directly by choosing the respective source type while uploading the data or by configuring it in the inputs.conf file.
The preconfigured settings for any of the preceding structured data is very generic. Many times, it happens that the machine logs are customized structured logs; in that case, additional settings will be required to parse the data.
For example, there are various types of XML. We have listed two types here. In the first type, there is the <note> tag at the start and </note> at the end, and in between, there are parameters are their values. In the second type, there are two levels of hierarchies. XML has the <library> tag along with the <book> tag. Between the <book> and </book> tags, we have parameters and their values.
The first type is as follows:
<note> <to>Jack</to> <from>Micheal</from> <heading>Test XML Format</heading> <body>This is one of the format of XML!</body> </note>
The second type is shown in the following code snippet:
<Library> <book category="Technical"> <title lang="en">Splunk Basic</title> <author>Jack Thomas</author> <year>2007</year> <price>520.00</price> </book> <book category="Story"> <title lang="en">Jungle Book</title> <author>Rudyard Kiplin</author> <year>1984</year> <price>50.50</price> </book> </Library >
Similarly, there can be many types of customized XML scripts generated by machines. To parse different types of structured data, Splunk Enterprise comes with inbuilt settings and configuration defined for the source it comes from. Let's say, for example, that the data received from a web server's logs are also structured logs and it can be in either a JSON, CSV, or simple text format. So, depending on the specific sources, Splunk tries to make the job of the user easier by providing the best settings and configuration for many common sources of data.
Some of the most common sources of data are data from web servers, databases, operation systems, network security, and various other applications and services.
Web and cloud services
The most commonly used web servers are Apache and Microsoft IIS. All Linux-based web services are hosted on Apache servers, and all Windows-based web services on IIS. The logs generated from Linux web servers are simple plain text files, whereas the log files of Microsoft IIS can be in a W3C-extended log file format or it can be stored in a database in the ODBC log file format as well.
Cloud services such as Amazon AWS, S3, and Microsoft Azure can be directly connected and configured according to the forwarded data on Splunk Enterprise. The Splunk app store has many technology add-ons that can be used to create data inputs to send data from cloud services to Splunk Enterprise.
So, when uploading log files from web services, such as Apache, Splunk provides a preconfigured source type that parses data in the best format for it to be available for visualization.
Suppose that the user wants to upload apache error logs on the Splunk server, and then the user chooses apache_error from the Web category of Source type, as shown in the following screenshot:
On choosing this option, the following set of configuration is applied on the data to be uploaded:
- The event break is configured to be on the regular expression pattern ^\[
- The events in the log files will be broken into a single event on occurrence of [ at every start of a line (^)
- The timestamp is to be identified in the [%A %B %d %T %Y] format, where:
- %A is the day of week; for example, Monday
- %B is the month; for example, January
- %d is the day of the month; for example, 1
- %T is the time that has to be in the %H : %M : %S format
- %Y is the year; for example, 2016
- Various other settings such as maxDist that allows the amount of variance of logs can vary from the one specified in the source type and other settings such as category, descriptions, and others.
Any new settings required as per our needs can be added using the New Settings option available in the section below Settings. After making the changes, either the settings can be saved as a new source type or the existing source type can be updated with the new settings.
IT operations and network security
Splunk Enterprise has many applications on the Splunk app store that specifically target IT operations and network security. Splunk is a widely accepted tool for intrusion detection, network and information security, fraud and theft detection, and user behaviour analytics and compliance. A Splunk Enterprise application provides inbuilt support for the Cisco Adaptive Security Appliance (ASA) firewall, Cisco SYSLOG, Call Detail Records (CDR) logs, and one of the most popular intrusion detection application, Snort. The Splunk app store has many technology add-ons to get data from various security devices such as firewall, routers, DMZ, and others. The app store also has the Splunk application that shows graphical insights and analytics over the data uploaded from various IT and security devices.
The Splunk Enterprise application has inbuilt support for databases such as MySQL, Oracle Syslog, and IBM DB2. Apart from this, there are technology add-ons on the Splunk app store to fetch data from the Oracle database and the MySQL database. These technology add-ons can be used to fetch, parse, and upload data from the respective database to the Splunk Enterprise server.
There can be various types of data available from one source; let's take MySQL as an example. There can be error log data, query logging data, MySQL server health and status log data, or MySQL data stored in the form of databases and tables. This concludes that there can be a huge variety of data generated from the same source. Hence, Splunk provides support for all types of data generated from a source. We have inbuilt configuration for MySQL error logs, MySQL slow queries, and MySQL database logs that have been already defined for easier input configuration of data generated from respective sources.
Application and operating system data
The Splunk input source type has inbuilt configuration available for Linux dmesg, syslog, security logs, and various other logs available from the Linux operating system. Apart from the Linux OS, Splunk also provides configuration settings for data input of logs from Windows and iOS systems. It also provides default settings for Log4j-based logging for Java, PHP, and .NET enterprise applications. Splunk also supports lots of other applications' data such as Ruby on Rails, Catalina, WebSphere, and others.
Splunk Enterprise provides predefined configuration for various applications, databases, OSes, and cloud and virtual environments to enrich the respective data with better parsing and breaking into events, thus deriving at better insight from the available data. The applications' source whose settings are not available in Splunk Enterprise can alternatively have apps or add-ons on the app store.
Data input methods
Splunk Enterprise supports data input through numerous methods. Data can be sent on Splunk via files and directories, TCP, UDP, scripts or using universal forwarders.
Files and directories
Splunk Enterprise provides an easy interface to the uploaded data via files and directories. Files can be directly uploaded from the Splunk web interface manually or it can be configured to monitor the file for changes in content, and the new data will be uploaded on Splunk whenever it is written in the file. Splunk can also be configured to upload multiple files by either uploading all the files in one shot or the directory can be monitored for any new files, and the data will get indexed on Splunk whenever it arrives in the directory. Any data format from any sources that are in a human-readable format, that is, no propriety tools are needed to read the data, can be uploaded on Splunk. Splunk Enterprise even supports uploading in a compressed file format such as (.zip and .tar.gz), which has multiple log files in a compressed format.
Splunk supports both TCP and UDP to get data on Splunk from network sources. It can monitor any network port for incoming data and then can index it on Splunk. Generally, in case of data from network sources, it is recommended that you use a Universal forwarder to send data on Splunk, as Universal forwarder buffers the data in case of any issues on the Splunk server to avoid data loss.
Splunk Enterprise provides direct configuration to access data from a Windows system. It supports both local as well as remote collections of various types and sources from a Windows system.
Splunk has predefined input methods and settings to parse event log, performance monitoring report, registry information, hosts, networks and print monitoring of a local as well as remote Windows system.
So, data from different sources of different formats can be sent to Splunk using various input methods as per the requirement and suitability of the data and source. New data inputs can also be created using Splunk apps or technology add-ons available on the Splunk app store.
Adding data to Splunk—new interfaces
Splunk Enterprises introduced new interfaces to accept data that is compatible with constrained resources and lightweight devices for Internet of Things. Splunk Enterprise version 6.3 supports HTTP Event Collector and REST and JSON APIs for data collection on Splunk.
HTTP Event Collector is a very useful interface that can be used to send data without using any forwarder from your existing application to the Splunk Enterprise server. HTTP APIs are available in .NET, Java, Python, and almost all the programming languages. So, forwarding data from your existing application that is based on a specific programming language becomes a cake walk.
Let's take an example, say, you are a developer of an Android application, and you want to know what all features the user uses that are the pain areas or problem-causing screens. You also want to know the usage pattern of your application. So, in the code of your Android application, you can use REST APIs to forward the logging data on the Splunk Enterprise server. The only important point to note here is that the data needs to be sent in a JSON payload envelope. The advantage of using HTTP Event Collector is that without using any third-party tools or any configuration, the data can be sent on Splunk and we can easily derive insights, analytics, and visualizations from it.
HTTP Event Collector and configuration
HTTP Event Collector can be used when you configure it from the Splunk Web console, and the event data from HTTP can be indexed in Splunk using the REST API.
HTTP Event Collector
HTTP Event Collector (EC) provides an API with an endpoint that can be used to send log data from applications into Splunk Enterprise. Splunk HTTP Event Collector supports both HTTP and HTTPS for secure connections.
The following are the features of HTTP Event Collector, which make's adding data on Splunk Enterprise easier:
- It is very lightweight is terms of memory and resource usage, and thus can be used in resources constrained to lightweight devices as well.
- Events can be sent directly from anywhere such as web servers, mobile devices, and IoT without any need of configuration or installation of forwarders.
- It is a token-based JSON API that doesn't require you to save user credentials in the code or in the application settings. The authentication is handled by tokens used in the API.
- It is easy to configure EC from the Splunk Web console, enable HTTP EC, and define the token. After this, you are ready to accept data on Splunk Enterprise.
- It supports both HTTP and HTTPS, and hence it is very secure.
- It supports GZIP compression and batch processing.
- HTTP EC is highly scalable as it can be used in a distributed environment as well as with a load balancer to crunch and index millions of events per second.
In this article, we walked through various data input methods along with various data sources supported by Splunk. We also looked at HTTP Event Collector, which is a new feature added in Splunk 6.3 for data collection via REST to encourage the usage of Splunk for IoT. The data sources and input methods for Splunk are unlike any generic tool and the HTTP Event Collector is the added advantage compare to other data analytics tools.